Calculating variant penetrance from family history of disease and average family size in population-scale data

Background Genetic penetrance is the probability of a phenotype when harbouring a particular pathogenic variant. Accurate penetrance estimates are important across biomedical fields including genetic counselling, disease research, and gene therapy. However, existing approaches for penetrance estimation require, for instance, large family pedigrees or availability of large databases of people affected and not affected by a disease. Methods We present a method for penetrance estimation in autosomal dominant phenotypes. It examines the distribution of a variant among people affected (cases) and unaffected (controls) by a phenotype within population-scale data and can be operated using cases only by considering family disease history. It is validated through simulation studies and candidate variant-disease case studies. Results Our method yields penetrance estimates which align with those obtained via existing approaches in the Parkinson’s disease LRRK2 gene and pulmonary arterial hypertension BMPR2 gene case studies. In the amyotrophic lateral sclerosis case studies, examining penetrance for variants in the SOD1 and C9orf72 genes, we make novel penetrance estimates which correspond closely to understanding of the disease. Conclusions The present approach broadens the spectrum of traits for which reliable penetrance estimates can be obtained. It has substantial utility for facilitating the characterisation of disease risks associated with rare variants with an autosomal dominant inheritance pattern. The yielded estimates avoid any kinship-specific effects and can circumvent ascertainment biases common when sampling rare variants among control populations. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-022-01142-7.


Calculating variant penetrance from family history of disease and average family size in population-scale data
Table S4 -Estimation of the incidence of amyotrophic lateral sclerosis relative to frontotemporal dementia among people of European ancestry who harbour the pathogenic hexanucleotide GGGGCC repeat expansion of the C9orf72 gene 3.5. Table S5 -Comparison of unadjusted penetrance estimates derived for the case studies presented in Table 2 between the lookup table and maximumlikelihood approaches 3.6. Table S6 -Direction of change in R(X) obs and penetrance estimates according to increases in variant frequency and weighting factor estimates.

Penetrance calculation procedure
Here follows an outline of the present approach to penetrance estimation. This method is available as an R function (R Version 4.1.2) accessible at https://github.com/ThomasPSpargo/adpenetrance/.
Step 1: To calculate penetrance using this method, we must identify the rate at which one of the defined disease states (familial, sporadic, unaffected, affected) occurs in families harbouring the variant sampled across a valid combination of two or three of these states (see Table 1). This rate is denoted as ( ), and X can be any one of the four disease states for which variant information were provided.
Definitions: Familial = more than one family member affected Sporadic = only one family member affected Unaffected = no family member affected Affected = at least one family member affected -familial or sporadic not specified. In Step 1, we determine ( ) as it is observed within input data, ( ) !"# . If known, ( ) !"# can be specified directly, alongside a corresponding indication of the states from which this estimate is derived. If the familial state is represented within input data, then state X is familial. If only the sporadic and unaffected states are represented, then state X is sporadic. If the affected and unaffected states are represented, then state X is affected.
( ) !"# can also be derived as a weighted proportion of heterozygous variant frequency estimates drawn from samples of unrelated people from two or three of the familial, sporadic, and unaffected disease states or the affected and unaffected states. When variant frequency estimates for the familial or sporadic states are included, the frequency of familial, ( | ), and sporadic, ( | ), disease among the affected population, A, must feature in weightings; note that, as familial and sporadic states are binary outcomes within the affected population, ( | ) = 1-( | ). Where the unaffected or affected groups are represented, baseline (e.g. lifetime) risk of a population member being affected, ( ) $!$ , must be included within weightings.
In this weighted proportion calculation, we respectively denote variant frequencies for familial, sporadic, unaffected, and affected states as %,',(,) , to be weighted by the factors %,', (,) . Given that representation of any two or three of the familial, sporadic, and unaffected disease states or the affected and unaffected states can be used estimate ( ) !"# , we let the familial, sporadic, unaffected, and affected states be arbitrarily denoted as the states , , and . Accordingly, letting %,',(,) and %,',(,) arbitrarily be *,+,, and if data are given for the familial, sporadic, and unaffected disease states. Note that all 4 states cannot be specified together as the familial and sporadic states are subsumed within the affected state. For this reason, it is also unsuitable to represent the affected state alongside data for either or both of the familial or sporadic states. Table 1 presents all possible disease state combinations and outlines how the associated weighting factors should be defined to calculate ( ) !"# .
Step 2: A lookup table to which ( ) !"# can be compared for penetrance estimation is generated here. This table stores a series of ( ) values that would be expected at a given value of penetrance, . , in a population with average sibship size , and (optionally) the residual disease risk for people who do not harbour the variant, which can be calculated according to equations 9-11 but is assumed to be 0 by default. We denote this series of ( ) values as ( ) . /0 . The sibship size must be defined alongside the data provided for Step 1 and should represent the average sibship size of the sample from which ( ) !"# is determined.
Step 3: The ( ) !"# estimate obtained in Step 1 is used to query the lookup table generated in Step 2. The value of ( ) . /0 closest to ( ) !"# is identified and the corresponding penetrance value is taken (see Supplemental Methods 1.2.1 and Table S5 for comparison to a maximum-likelihood approach). This value is an uncorrected penetrance estimate, 123451#6/4 , subject to a systematic bias within the approach and should not therefore be taken as the final estimate; this is determined in step 4. Note that ( ) !"# ≈ ( ) . /0 unless ( ) !"# exceeds or is less than the rate of state expected between = 0, … ,1 at and .
Step 4: This step computes the final penetrance estimate to be returned by the method, 3451#6/4 . It corrects for systematic bias in the 123451#6/4 estimate from in Step 3, which diverges from the true penetrance value according to the combination of states modelled, the value of penetrance, and the structure of families sampled (see Fig S1; Fig S2).
In Step 4, firstly, a simulated dataset of 90,000 families is pseudo-randomly generated, where each simulated family is assigned a sibship size of the value .
(#.8) . The population generated in this step aims to approximate the sibship structure of the real population sampled for penetrance estimation. To ensure replicability, all pseudo-randomisation in this step is performed using the R seed 24.
By default, simulated sibships follow a Poisson distribution with the lambda defined by the mean sibship size, , specified for the real sample data. Example simulated Poisson sibship distributions are presented in Fig S2 A.i-D.i. The Poisson distribution was selected as the default simulation distribution as it is a discrete probability distribution useful for estimating the number of events expected to occur within a given time frame. In this instance, an event is having a child (1 sib) and the time frame is the childbearing years for that family. We note that the Poisson distribution assumes the independence of events and that this assumption would not hold in the present instance (i.e. in real populations, the probability of having additional offspring will be influenced by having already had . offspring). However, Fig S1 demonstrates that the degree of error made in Step 3 penetrance estimates is comparable between the Poisson distribution (Panel A.ii) and other hypothetical population structures (Panels B.ii-D.ii), including the distribution shown in C.i, which resembles that of a UK 1974 population birth cohort 1 . Therefore, a simulated population in which sib-sizes follow a Poisson distribution can be considered sufficient for approximating the expected error in unadjusted penetrance estimates made using data from randomly sampled populations. This is corroborated by the results of the simulations presented in Supplemental Methods 1.2.3).
If the structure of sibships in the real sample is known, then the user can optionally supply the adpenetrance R function with either a vector containing all the sampled sibship sizes or a summary of the sibship distribution, declaring the sibship sizes contained in the sample and the proportion of the sample each sib-size represents. When sibship data are supplied, a simulated sibship distribution is generated based on these data, including only the sibship sizes represented and following its sibship distribution. This 'tailored' simulation population will give more precise 3451#6/4 estimates than those obtained using the default Poisson distribution (see Supplemental Methods 1.2.3). However, the Poisson distribution is sufficiently precise for adjustment when the sibship distribution of the real data are unknown, under the assumption that population sampling is random and does not exclude families of a particular sibship size (e.g. families of sibship size 0 are not excluded).
A sequence of 25 penetrance values between 0.01 and 1 is also defined, representing true penetrance values of a simulated variant, 6 Fig S2).
The fitted polynomial regression model is then used to predict error in the penetrance estimate made for the real dataset in Step 3 based on the value of . 123451#6/4 ,
123451#6/4 estimate obtained in Step 3. These values are then adjusted as in Step 4 according to the fitted polynomial regression model, giving the final penetrance estimates at the confidence interval bounds.

Approach validation and testing
The R scripts used for approach validation are available within our GitHub repository: https://github.com/ThomasPSpargo/adpenetrance/.

Lookup table validation: an alternative maximum-likelihood approach
The unadjusted penetrance estimates obtained in Step 3, 123451#6/4 , can also be derived following a maximum likelihood approach. To validate the lookup table approach implemented, we additionally derived 123451#6/4 estimates using Non-Linear Minimisation, leveraging nlm and dbinom functions available within the R stats package (version 4.1.2) 3 .
We constructed this validation approach by defining a negative likelihood function which determines, under a binomial distribution, the likelihood of the specified ( ) !"# at a given 123451#6/4 and . Within this function, values of ( ) !"# are transformed into integers so that they represent a number of state X events across a certain number of trials (e.g. the rate 0.394 would be multiplied by three orders of magnitude, giving 394 events across 1000 trials). The probability function is defined using equations 5-7, and according to the states modelled in calculating ( ) !"# .
Non-Linear Minimisation was then applied to determine the most likely 123451#6/4 given ( ) !"# , , and . The starting value for minimisation was defined as the 123451#6/4 estimate previously determined via the Step 3 lookup approach.
This approach was applied to each of the case studies presented in Table 2 and we found negligible difference between the 123451#6/4 estimates generated within non-linear minimisation and via the lookup table method (see Table S5). Thus, these findings confirm the validity of the lookup table approach. The alternative maximum-likelihood method was not adopted for penetrance calculation to avoid potential issues in model convergence if starting values are not appropriately defined.

Age-dependent penetrance: tolerance to age of sampling
The penetrance of variant for an associated disease is determined within the present method according to ( ) !"# , , and . If age of disease onset varies across people harbouring the variant, then penetrance is also age-dependent. In a sample consisting only of families harbouring variant , ( ) !"# will inherently vary over time as people from sampled families age and become affected. Accordingly, penetrance estimates would be lower at an earlier time of sampling, and not accurately represent the true lifetime penetrance. This effect is demonstrated below within a simulation study (see Supplemental Materials 1.2.3, Fig S9). Accordingly, a lifetime penetrance estimate is best obtained within this scenario when people sampled are beyond the typical age of onset for the studied trait.
Within a second sampling scenario, where ( ) !"# is determined indirectly as a weighted proportion of a given disease state across variant frequency estimates (per equations S1 and S2) from samples of people with and without the variant across a valid combination of disease states, age-dependence will have a smaller effect upon estimation of lifetime penetrance. This is true if the variability in the rate at which family disease states change over time are comparable between families affected by disease where a variant of interest does and does not occur. To illustrate this assumption with an example: If at a given time 100 of 1000 people with sporadic disease harbour the variant of interest, the variant frequency is 0.1. Suppose then that at a later time of sampling, 200 people of the original sample are now considered 'familial'. If the rate of family disease state change is comparable for people with and without the variant over time, then roughly 180 people without and 20 with the variant would have been reassigned as familial. This leaves 80 of 800 people harbouring the variant in the sporadic sample and the variant frequency remains 0.1. Accordingly, under this assumption, variant frequency estimates within a given disease state will be largely stable over time.
In practice, the rate of change over time is unlikely to correspond exactly between people with and without variant . However, the assumption is reasonable for a disease with a heritable genetic basis when the tested variant is not thought to be indicative of an entirely distinct onset profile. Accordingly, whether the assumption is true will be influenced by two factors: (1) that variability in age of disease onset is comparable for people who will be affected in their lifetime with and without a given variant, and (2) that the number of disease occurrences (across the range of zero and two or more affected) within families is similar between the groups.
The first of these can be tested by comparing the age of disease onset profile for people with and without a given variant; if the groups have 'equal onset variability' over time, then the assumption is more likely met. The important aspect of this test is that people with and without the variant progress from being unaffected to affected at a similar rate across age; absolute differences in age of onset between group (i.e., where a variant is associated with a younger/older disease phenotype) are tolerated. When equal onset variability is observed, change in ( ) !"# over time will be determined by differences number of disease occurrences within families between groups; its estimation will be less affected by agedependence than when sampling only from families within the variant group.
To facilitate testing of equal onset variability, we have made available an additional R function within the ADPenetrance GitHub repository 4 : checkOnsetVariability. This function allows users to supply information regarding age of disease onset for two sample groups (with and without a given variant). The age of onset is then centred for each group by a chosen metric (e.g., mean or median), to enable (base R) plotting of either a density or cumulative density function which overlays onset variability for the two groups. In addition, the function calculates the relative difference in span of time between the first and third quartiles of disease onset in each group. (e.g., if there is an 8-year interval between the first and third quartile for onset among people with variant , and a 10-year interquartile interval for people without , then the relative difference is 10/8 = 1.25, indicating that the variability in disease onset 1.25 is smaller among people with variant , with less time taken to span the interquartile interval). This number is returned to users of checkOnsetVariability as a quantifiable indication of the scale of departure from the equal onset variability. Values of approximately 1 indicate equal onset variability, values > 1 indicate that the onset interval is shorter for people in the variant group, values < 1 indicate that the onset interval is protracted for people in the variant group. An example of plots returned using the checkOnsetVariability function is provided in Fig S4, which presents testing of equal onset variability in the ALS case studies modelled versus a 'no variant' ALS population, characterised by absence of variants in C9orf72 and SOD1.
The relative difference in onset variability returned by checkOnsetVariability can be supplied to a further function also available on GitHub 4 , simADPenetrance, which enables users to perform a simulation study that returns a plot which visualises how much a given degree of departure from the assumption may affect penetrance estimates according to sampling age. The plyr (version 1.8.7), ggplot2 (version 3.4.0) and reshape2 (version 1.4.4) packages are dependencies for simADPenetrance [5][6][7] .
We present figures from simulation studies, performed using the simADPenetrance function, which demonstrate accuracy of lifetime penetrance estimation according to age of sampling and degree of departure from the test of equal onset variability. In these simulations, families containing the variant of interest are compared to a wider disease cohort of families without this variant and instead harbouring one of several other variants of varying penetrance. In Fig  The simulations demonstrate reasonable accuracy in penetrance estimation across time of sampling when the assumption is met, and tolerable stability when the assumption violated by the tested degree of departure.
A full description of these simulation studies is provided subsequently (Section 1.2.3), and documentation for checkOnsetVariability and simADPenetrance is provided on GitHub 4 .

Simulation studies
Here we present the results of simulation studies conducted to test the validity of the 4-step approach outlined in Supplemental Methods 1.1. The studies described are split into 2 sets according to the methodology followed for generating simulated families. The simulated datasets used within all studies were generated pseudo-randomly in R with no set seed number and = 0 except where stated.
Across both sets of simulation studies, families were pseudo-randomly generated based on sibship distributions previously reported in two distinct samples (see Fig S3).
The first simulated population (henceforth: the UK population) resembles the sibship distribution across the UK population 1974 birth cohort at the end of their childbearing years (defined as 45 years of age) 1 . The families within this simulated dataset were each pseudo-randomly assigned a sibship size between 0 and 4 according to the probabilities observed in this cohort (see Fig S3) and the mean sibship size, , is 1.84. The simulation population was modelled on these data because they describe the most recent birth cohort for which data is available at the completion of childbearing years and because the distribution is representative of a randomly sampled population. The distribution of sibship sizes across this cohort is comparable to other reported UK and USA birth cohorts 1,8 .
The second population (henceforth: the NS population) was simulated based on the distribution of sibship sizes reported for the Next Steps dataset, a longitudinal sample of children from England 9 . Simulated families were pseudo-randomly assigned a sibship size between 1 and 7 according to the probabilities observed in the Next Steps sample (see Fig  S3) and = 3.006. The simulation cohort was modelled on these data to illustrate the application of the method to a sample not fully representative of the population. In this case, the sample does not include families of sibship size 0.

Set 1:
In the first set of simulation studies, the performance of the method was tested on simulated populations containing 90,000 simulated families.
A series of ground truth penetrance values, .
6:1/ , were generated for testing within each study. For each .
6:1/ , families from the two simulated populations were generated as described above and the familial, sporadic, and unaffected disease state probabilities expected at each of the occurring sibship sizes were calculated using equations 5-7. One of these three disease states was then pseudo-randomly assigned to each family with the probabilities expected in a family of that the sibship size. Penetrance estimates, .
3451#6/4 , were then made for the population simulated under the specifications of that study. In each study, to test the two estimate adjustment approaches allowed in Step 4, we estimated . /::!: firstly when the method is supplied no information about the distribution of sibship sizes in the sample data and secondly when this information is supplied. As described in Step 4 (see Supplemental Methods 1.1), the former condition adjusts

Validation under correct parameter specification
We first tested the approach by examining the accuracy of penetrance estimates made using correctly specified input parameters in simulated UK and NS populations harbouring hypothetical variants with known true penetrance values. A sequence of 20 ground truth penetrance values was first defined: . 6:1/ = (0.05,0.10, … , 1) and the populations were simulated as described above. To examine the influence of , we simulated scenarios where = (0,0.001,0.1). Penetrance estimates, .
3451#6/4 , were made for these populations, defining according to the mean sibship size of that sample, approximately 1.84 for the UK and 3.01 for the NS populations, and with ( ) !"# calculated across all possible disease state combinations. . /::!: was then determined. This simulation was repeated 5 times for each value of . 6:1/ , and the results are shown in Fig S5, averaged across repetitions to determine the mean . /::!: observed at each value of . 6:1/ , across each of the disease state combinations. These findings evidence the validity and accuracy of penetrance estimates generated via this approach. They also demonstrate the benefit of supplying about the distribution of sibships in the sample data when this is known; this benefit is greater if sample data does not accurately represent sibship sizes across the population (e.g., where the NS dataset contains no families of sibship size 0).

Simulation under incorrect parameter specification
Misspecification of sibship size: This simulation study examines the accuracy of penetrance estimates when the mean sibship size of sample populations is incorrectly defined. We simulate a wide range of misspecification for sibship size here, although it is likely that degree of misspecification in N would be relatively small for any population-representative sample.
Several values of true penetrance were defined: The results of these simulations are presented in Fig S6. The increased impact of misspecifying N upon penetrance estimates in the UK compared to NS populations reflects that the difference in disease state rates between a family of 0 sibs compared to a family of 1 sibs is greater than between 1 and 2 or 2 and 3 sib families (etc.); this difference is illustrated in the original description of this disease model 10 . Accordingly, misspecified, and particularly underestimated, N will be more impactful on penetrance estimation in the UK population, which has a lower mean sibship size than NS, since variation in disease state rates is greater between individual family sizes when there are fewer sibs.
Misspecification of disease state rates: This simulation study examines the accuracy of penetrance estimates when ( ) !"# is incorrectly estimated. ( ) !"# can be supplied directly to the tool or estimated from variant frequency estimates and weighting factors when supplying any valid disease state combination (see Table 1). Estimates of ( ) !"# , and subsequently penetrance, increase alongside increases in * or * , and decrease alongside increases +,, or +,, . Table S6 summarises the direction of change in ( ) !"# and associated penetrance estimates when values of each input parameter increase for each of the valid disease state combinations.
In this simulation study, several values of true penetrance were defined: . 6:1/ = (0.10,0.25,0.50,0.75,1.00). A sequence of values to represent the degree of error in disease state rate estimates was also specified: ( ) . 8!4.CD = (−0.15, −0.10 … ,0.15). The UK and NS populations were simulated as before. ( ) !"# was calculated for a given . 6:1/ across each of the five possible disease combinations, with the ( ) !"# value to be defined in penetrance estimation being adjusted across each value of ( ) .

Simulation to test influence of g accuracy upon estimate accuracy
Here we examine how the importance of specifying residual disease risk varies for penetrance estimation according to the prevalence of the disease, reflected in increased . We estimate penetrance when is correctly specified and when assumed that = 0. This is The results of this simulation study are shown in Fig S8. It illustrates that when the disease is rare in the population, and therefore is small, accounting for is less critical for attaining accurate penetrance estimates. However, for more common diseases, this is essential.

Set 2:
This second set of simulation studies simulations aims to test the influence of age sampling upon the accuracy of penetrance estimation in phenotypes with age-dependent onset. Several simulation scenarios are presented.
In each simulation, several values of true penetrance were tested: . 6:1/ = (0.25,0.50,0.75,1.00). Each simulation was repeated 3 times for each value of . 6:1/ , and the results were averaged across each repetition. As above, penetrance estimates, . /::!: , which in this simulation reflects difference between the estimate and lifetime penetrance at each time of sampling, was then determined. Each simulation was repeated 3 times for each value of . 6:1/ and the results were averaged across these triplicates.
As before, population structures were firstly generated by pseudo-randomly assigning each family a given sibship size, between 0 and 4 for the UK population and 1 and 7 for the NS sample according to the probabilities of each sibship size per population (see Fig S3).
For a given family of sibship size . , individual family members are then generated, consisting of two parents and . siblings. Family members are each assigned relative ages at the time of first sampling, where 0 indicates the final age before the simulated disease becomes onsets in any person with or without the variant. The youngest of . siblings is assigned age 0, and the other siblings are, using the rnorm function, pseudo-randomly assigned age differences of mean 3 (SD=0.75) which are then summed relative to the age of the next youngest sibling and rounded to the nearest integer. This produces . siblings separated by ~3 years of age. Each of the two parental ages are also assigned using rnorm. In a family with . = 0 1 , 'parental' ages are generated as mean age 25 (SD=3), rounded to the nearest integer. If . > 1, the mean age is adjusted in line with the age of the oldest sibling (e.g., if the oldest sibling is 9, then mean parental age is 34).
We simulate a disease which may onset across a 10-year period, where (as above) 0 represents the final age before disease could onset and 10 represents age by which all disease occurrences have onset. We optionally allow the onset window to scale separately within this 10-year window according to variant status (whether or not the variant with penetrance . 6:1/ is harboured). To give an example scenario: all disease occurrences will onset between ages 1 and 10, but onset for people with a variant of .
6:1/ onset may be from ages 1 to 7 versus 1 to 10 in people not harbouring .
6:1/ . Letting the onset scale to be distinct according to variant status enabled us to test the impact of deviation from equal onset variability (see Supplementary Methods 1.2.2). Except where specified, these simulations let disease risk scale equally and onset between times 1 and 10 for people with and without . 6:1/ .
Accordingly, age-dependent disease risk is defined as a proportion of the lifetime risk to an individual according to their current age relative to the disease onset period and whether they harbour, do not harbour, or have 50% probability of inheriting the variant which has lifetime penetrance . Accordingly, the disease probability, ( ), for an individual at relative age is: if they harbour , and 5 C is the proportion of people with the variant of lifetime penetrance affected by time point ; Then, if variant is absent, denoted ', where 5 E is the proportion of people with residual risk who are affected by time point ; Finally, if they have 0.5 probability of inheriting from a variant-harbouring parent, denoted F.H . Equations S4-S6 mirror equations 2-4 of the main manuscript, with the integration of the term.
Let = (0, … , , … , ) denote the time from the first sampling (at = 0) until and including the time when the youngest family member reaches the final age for disease to onset, . We simulate, using the rbinom function, whether each family member is affected at age 6 , according to the probability relevant to that person based on their variant status ( , I , F.H ) per S4-S6. We then sum the number of affected family members at each , and define the family as 'unaffected' if no family member has disease at , 'sporadic' if one family member has disease, or 'familial' if two or more family members have disease.
Families generated across the simulated population are then combined. When the number of sampling points until varies between families, disease state assignments at = are duplicated for those families with fewer sampling points until length of is equal across the population. Penetrance is then estimated for each of the 5 possible disease state combinations at each time .
Several simulation studies are now presented, demonstrate the effect of age across several scenarios.

Age-dependence when sampling only families harbouring interest variant.
As described in Supplemental Methods 1.2.2, ( ) !"# will vary greatly in traits with agedependent onset according to age of sampling when calculated directly from the observed proportions of disease states across a cohort consisting only of people harbouring the variant. We simulate this scenario by generating a cohort of 100,000 families per the above method where each family contains one variant-harbouring parent, one parent not harbouring the variant, and . siblings who have a 50% chance of inheriting the variant. We estimate penetrance based on disease state proportions across the sample for each of the 5 possible disease state combinations at each of = 0, … , representing the period across which the youngest sibling of each family could become affected. Fig S9 presents the results of this simulation. Penetrance estimates varied most when sampling includes the Familial state since most Familial state occurrences will emerge across this time period. Sampling the Sporadic or Affected relative to the Unaffected states has smaller degree of change since the elder generation already have the maximum lifetime risk of disease by = 0. Should ( ) !"# be estimated based on disease state proportions across a sample of only people harbouring the variant, we suggest that lifetime penetrance is best estimated based on people in the sample who have passed a typical age for disease onset and since family disease states can reasonably be expected not to change further.

Age-dependence when sampling across families with or without variant across a disease cohort
Age-dependence will affect lifetime penetrance estimation less substantially when ( ) !"# is estimated from variant frequencies within each disease state and weighting factors defined by the general characteristics of the disease (see Table 1).
We simulate this scenario by generating a general disease cohort across which only certain families harbour the variant of interest, , which has lifetime penetrance . 6:1/ . Variant occurs within 100,000 of the generated families. A further 100,000 families are generated, where no family member harbours , and instead occurs one of several other variants with autosomal dominant inheritance for the disease. Disease risks per age associated with these competing variants are generated as per equations S4-S6, but for further variants with lifetime penetrance .  Table 1 and Equations S1 and S2 as a weighted proportion of the relevant variant frequency estimates observed at each and the appropriate weighting factors. At all times, weighting factors were defined according to their value at the final sampling time ( = ).

Fig S10 displays the results of this simulation. After
Step-4 error correction, and for . = (0.25,0.5,0,75) penetrance estimated diverged from the true penetrance by no more than 5% at most sampling times and disease state combinations. When . = 1.0, error was somewhat greater when sampling the familial, sporadic and unaffected, or the familial and sporadic states, but within a tolerable distance of true penetrance across all times of sampling. For all values of . penetrance was more accurately estimated as age approached the maximum lifetime risk.
In two further simulations, we modelled scenarios alike the previous simulation, but with unequal onset variability between groups. Thus, the onset window for disease differed among people with variant and those with the competing variants (For example of this, see Fig S4). In the first simulation, we let the onset window for people with the variant be 1.3 times shorter than for those without the variant (This is comparable to the relative difference in time spanned by the interquartile interval in people with ALS harbouring the C9orf72 variant compared to people with no C9orf72 or SOD1 variant; shown in Fig S4). Accordingly, in this simulation all families in which variant occurred reached their final disease state assignment by = 8, as opposed to = 10 for families where a competitor variant occurred. The results of this simulation are presented in Fig S11. In the second simulation, we test the inverse of the previous analysis, with the relative onset variability of 0.77 (≈1/1.3), letting instead the onset window be shorter for people harbouring competitor variants (reaching final family disease states by = 8). The results of this simulation are given in Fig S12. In both simulations where the variability of disease onset differed between people with and without the variant, penetrance was estimated with tolerable accuracy across all ages and values of . 6:1/ . However, further departure from equal onset variability would have greater impact upon penetrance estimation (see Supplementary Methods 1.2.2).

ADPenetrance: a companion web tool
This method of penetrance calculation is additionally available as an open-access web tool accessible at https://adpenetrance.rosalind.kcl.ac.uk. This was coded in R (Version 4.1.2) and leverages the R Shiny package (Version 1.7.3) 11 . An example of the interface and output of this tool is shown in Figure 2, as applied to estimation of SOD1 variant penetrance for ALS using data from a European sample as described in case study 3.
This tool can be used calculate penetrance for a given variant based on an estimate of ( ) !"# , a defined sibship size, and an estimate of . State X is assigned to a particular state based on which disease states are included within input data, as indicated by the user. Those states represented can be any two or all three of the familial, sporadic, and unaffected states or the unaffected and affected states. If the familial state is represented within input data, then state is familial. If only the sporadic and unaffected states are represented, then state is sporadic. If the affected and unaffected states are represented, then state is affected.
The user can derive ( ) !"# independently, manually specifying the rate of the state requested by the tool. Alternatively, they can provide variant characteristics and weighting factors (see Table 1), in order to calculate ( ) !"# as described in Step 1. These variant characteristics can be given in each disease state as either (1) variant counts and sample size among population-based samples or (2) directly as variant frequencies.
If data are given using variant counts and sample sizes for each disease state, then the error propagation step is included by default, deriving the standard error for each variant frequency from these values. If data are given using variant frequencies or if ( ) !"# is provided directly, then the user can opt to provide error terms for those estimates specified to enable error propagation. Error terms can be given either as standard errors or as confidence intervals from which standard errors are derived via z-score conversion. The user is asked to select which of these will be provided and, where confidence intervals are given, should indicate the level of confidence that these represent (95% confidence is assumed by default). Wherever error propagation is performed, the user will also need to specify the desired confidence level for the penetrance estimate output. This is to be selected from a series of options, where z-score conversion is used to transform the standard error of ( ) !"# into the upper and lower confidence interval bounds of this estimate, which can then be used to estimate the bounds of the penetrance estimate.
The user must also indicate the average sibship size, across the sample set. This can be specified either manually or by querying a repository of Total Fertility Rate estimates across many world regions which we have integrated within the tool 12 .
is assumed to equal 0 by default and can optionally be specified to indicate residual disease risk for people within sampled families who do not harbour the tested variant. This term is important for more common phenotypes (e.g., where > 0.01) but will have less influence upon penetrance estimation when the ≈ 0, as would be the case for rare traits.
Once input data are specified, the tool can be operated and ( ) . /0 is calculated for all values of . between 0 and 1 at increasing increments of 0.0001. Penetrance is then estimated as in Steps 3 and 4 and a results table is produced.